Developing Morphological Analysers for South Asian Languages: Experimenting with the Hindi and Gujarati Languages
نویسندگان
چکیده
A considerable amount of work has been put into development of stemmers and morphological analysers. The majority of these approaches use hand-crafted suffix-replacement rules but a few try to discover such rules from corpora. While most of the approaches remove or replace suffixes, there are examples of derivational stemmers which are based on prefixes as well. In this paper we present a rule-based morphological analyser. We propose an approach that takes both prefixes as well as suffixes into account. Given a corpus and a dictionary, our method can be used to obtain a set of suffix-replacement rules for deriving an inflected word’s root form. We developed an approach for the Hindi language but show that the approach is portable, at least to related languages, by adapting it to the Gujarati language. Given that the entire process of developing such a ruleset is simple and fast, our approach can be used for rapid development of morphological analysers and yet it can obtain competitive results with analysers built relying on human authored rules.
منابع مشابه
Introduction to Gujarati wordnet
Gujarati is one of the 22 official languages of India. It is an Indo-Aryan language descended from Sanskrit. Gujarati wordnet is being built using expansion approach with Hindi as the source language. This paper describes experiences of building Gujarati wordnet. Paper discusses basic features of Gujarati language and evaluates suitability of Hindi language for expansion approach. Various issue...
متن کاملExploiting Cross-Linguistic Similarities in Zulu and Xhosa Computational Morphology
This paper investigates the possibilities that cross-linguistic similarities and dissimilarities between related languages offer in terms of bootstrapping a morphological analyser. In this case an existing Zulu morphological analyser prototype (ZulMorph) serves as basis for a Xhosa analyser. The investigation is structured around the morphotactics and the morphophonological alternations of the ...
متن کاملAggregating Machine Learning and Rule Based Heuristics for Named Entity Recognition
This paper, submitted as an entry for the NERSSEAL-2008 shared task, describes a system build for Named Entity Recognition for South and South East Asian Languages. Our paper combines machine learning techniques with language specific heuristics to model the problem of NER for Indian languages. The system has been tested on five languages: Telugu, Hindi, Bengali, Urdu and Oriya. It uses CRF (Co...
متن کاملProceedings of the IJCAI – 2007 Workshop On Shallow Parsing for South Asian
As part of the IJCAI workshop on ”Shallow Parsing for South Asian Languages”, a contest was held in which the participants trained and tested their shallow parsing systems for Hindi, Bengali and Telugu. This paper gives the complete account of the contest in terms of how the data for the three languages was released, the performances of the participating systems and an overview of the approache...
متن کاملAligning Sentences and Words Using English-hindi Bilingual Parallel Corpora
This dissertation project relates to language engineering issues. The Enabling Minority Language Engineering (EMILLE) project is a collaborative work of The University of Sheffield and The Lancaster University. It aims to develop sixty-three million word electronic corpus of the South Asian Languages. As part of the EMILLE project, it was decided to develop a POS tagger for one of the languages...
متن کامل